{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "23243c2e",
            "metadata": {},
            "source": [
                "# Deploy a vision LLM\n",
                "\n",
                "<div align=\"left\">\n",
                "<a target=\"_blank\" href=\"https://console.anyscale.com/template-preview/deployment-serve-llm?file=%252Ffiles%252Fvision-llm\"><img src=\"https://img.shields.io/badge/🚀 Run_on-Anyscale-9hf\"></a>&nbsp;\n",
                "<a href=\"https://github.com/ray-project/ray/tree/master/doc/source/serve/tutorials/deployment-serve-llm/vision-llm\" role=\"button\"><img src=\"https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d\"></a>&nbsp;\n",
                "</div>\n",
                "\n",
                "A vision LLM can interpret images as well as text, enabling tasks like answering questions about charts, analyzing photos, or combining visuals with instructions. It extends LLMs beyond language to support multimodal reasoning and richer applications.  \n",
                "\n",
                "This tutorial deploys a vision LLM using Ray Serve LLM.  \n",
                "\n",
                "---\n",
                "\n",
                "## Configure Ray Serve LLM\n",
                "\n",
                "Make sure to set your Hugging Face token in the config file to access gated models.\n",
                "\n",
                "Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "ebc41d60",
            "metadata": {},
            "outputs": [],
            "source": [
                "# serve_qwen_VL.py\n",
                "from ray.serve.llm import LLMConfig, build_openai_app\n",
                "import os\n",
                "\n",
                "llm_config = LLMConfig(\n",
                "    model_loading_config=dict(\n",
                "        model_id=\"my-qwen-VL\",\n",
                "        model_source=\"qwen/Qwen2.5-VL-7B-Instruct\",\n",
                "    ),\n",
                "    accelerator_type=\"L40S\", # Or \"A100-40G\"\n",
                "    deployment_config=dict(\n",
                "        autoscaling_config=dict(\n",
                "            min_replicas=1,\n",
                "            max_replicas=2,\n",
                "        )\n",
                "    ),\n",
                "    ### Uncomment if your model is gated and needs your Hugging Face token to access it.\n",
                "    # runtime_env=dict(env_vars={\"HF_TOKEN\": os.environ.get(\"HF_TOKEN\")}),\n",
                "    engine_kwargs=dict(max_model_len=8192),\n",
                ")\n",
                "\n",
                "app = build_openai_app({\"llm_configs\": [llm_config]})\n"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "c76a6362",
            "metadata": {},
            "source": [
                "**Note:** Before moving to a production setup, migrate to a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs - Quickstart Examples: Production Guide](https://docs.ray.io/en/latest/serve/llm/quick-start.html#production-deployment) for an example.\n",
                "\n",
                "---\n",
                "\n",
                "## Deploy locally\n",
                "\n",
                "**Prerequisites**\n",
                "\n",
                "* Access to GPU compute.\n",
                "* (Optional) A **Hugging Face token** if using gated models. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`\n",
                "\n",
                "**Note:** Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta’s Llama models approval can take anywhere from a few hours to several weeks.\n",
                "\n",
                "**Dependencies:**  \n",
                "```bash\n",
                "pip install \"ray[serve,llm]\"\n",
                "```\n",
                "\n",
                "---\n",
                "\n",
                "### Launch\n",
                "\n",
                "Follow the instructions at [Configure Ray Serve LLM](#configure-ray-serve-llm) to define your app in a Python module `serve_qwen_VL.py`.  \n",
                "\n",
                "In a terminal, run:   "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "7eb8734c",
            "metadata": {},
            "outputs": [],
            "source": [
                "serve run serve_qwen_VL:app --non-blocking"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "d36f41d1",
            "metadata": {},
            "source": [
                "Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded. \n",
                "\n",
                "---\n",
                "\n",
                "### Sending requests with images\n",
                "\n",
                "Your endpoint is available locally at `http://localhost:8000` and you can use a placeholder authentication token for the OpenAI client, for example `\"FAKE_KEY\"`.\n",
                "\n",
                "Example curl with image URL:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "400e7790",
            "metadata": {},
            "outputs": [],
            "source": [
                "curl -X POST http://localhost:8000/v1/chat/completions \\\n",
                "  -H \"Authorization: Bearer FAKE_KEY\" \\\n",
                "  -H \"Content-Type: application/json\" \\\n",
                "  -d '{ \"model\": \"my-qwen-VL\", \"messages\": [ { \"role\": \"user\", \"content\": [ {\"type\": \"text\", \"text\": \"What do you see in this image?\"}, {\"type\": \"image_url\", \"image_url\": { \"url\": \"http://images.cocodataset.org/val2017/000000039769.jpg\" }} ] } ] }'"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "291743a5",
            "metadata": {},
            "source": [
                "Example Python with image URL:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "6b447094",
            "metadata": {},
            "outputs": [],
            "source": [
                "#client_url_image.py\n",
                "from urllib.parse import urljoin\n",
                "from openai import OpenAI\n",
                "\n",
                "API_KEY = \"FAKE_KEY\"\n",
                "BASE_URL = \"http://localhost:8000\"\n",
                "\n",
                "client = OpenAI(base_url=urljoin(BASE_URL, \"v1\"), api_key=API_KEY)\n",
                "\n",
                "response = client.chat.completions.create(\n",
                "    model=\"my-qwen-VL\",\n",
                "    messages=[\n",
                "        {\n",
                "            \"role\": \"user\",\n",
                "            \"content\": [\n",
                "                {\"type\": \"text\", \"text\": \"What is in this image?\"},\n",
                "                {\"type\": \"image_url\", \"image_url\": {\"url\": \"http://images.cocodataset.org/val2017/000000039769.jpg\"}}\n",
                "            ]\n",
                "        }\n",
                "    ],\n",
                "    temperature=0.5,\n",
                "    stream=True\n",
                ")\n",
                "\n",
                "for chunk in response:\n",
                "    content = chunk.choices[0].delta.content\n",
                "    if content:\n",
                "        print(content, end=\"\", flush=True)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "811f1d41",
            "metadata": {},
            "source": [
                "Example Python with local image:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "8296023b",
            "metadata": {},
            "outputs": [],
            "source": [
                "#client_local_image.py\n",
                "from urllib.parse import urljoin\n",
                "import base64\n",
                "from openai import OpenAI\n",
                "\n",
                "API_KEY = \"FAKE_KEY\"\n",
                "BASE_URL = \"http://localhost:8000\"\n",
                "\n",
                "client = OpenAI(base_url=urljoin(BASE_URL, \"v1\"), api_key=API_KEY)\n",
                "\n",
                "### From an image locally saved as `example.jpg`\n",
                "# Load and encode image as base64\n",
                "with open(\"example.jpg\", \"rb\") as f:\n",
                "    img_base64 = base64.b64encode(f.read()).decode()\n",
                "\n",
                "response = client.chat.completions.create(\n",
                "    model=\"my-qwen-VL\",\n",
                "    messages=[\n",
                "        {\n",
                "            \"role\": \"user\",\n",
                "            \"content\": [\n",
                "                {\"type\": \"text\", \"text\": \"What is in this image?\"},\n",
                "                {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/jpeg;base64,{img_base64}\"}}\n",
                "            ]\n",
                "        }\n",
                "    ],\n",
                "    temperature=0.5,\n",
                "    stream=True\n",
                ")\n",
                "\n",
                "for chunk in response:\n",
                "    content = chunk.choices[0].delta.content\n",
                "    if content:\n",
                "        print(content, end=\"\", flush=True)"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "ccc60c1f",
            "metadata": {},
            "source": [
                "\n",
                "---\n",
                "\n",
                "### Shutdown \n",
                "\n",
                "Shutdown your LLM service:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "id": "0ee4b879",
            "metadata": {},
            "outputs": [],
            "source": [
                "serve shutdown -y"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "a94c0307",
            "metadata": {},
            "source": [
                "\n",
                "---\n",
                "\n",
                "## Deploy to production with Anyscale services\n",
                "\n",
                "For production, it's recommended to use Anyscale services to deploy your Ray Serve app on a dedicated cluster without code changes. Anyscale provides scalability, fault tolerance, and load balancing, ensuring resilience against node failures, high traffic, and rolling updates. See [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html#deploy-to-production-with-anyscale-services) for an example with a small-sized model like the *Qwen2.5-VL-7&nbsp;B-Instruct* used in this tutorial.\n",
                "\n",
                "---\n",
                "\n",
                "## Limiting images per prompt\n",
                "\n",
                "Ray Serve LLM uses [vLLM](https://docs.vllm.ai/en/stable/) as its backend engine. You can configure vLLM by passing parameters through the `engine_kwargs` section of your Serve LLM configuration. For a full list of supported options, see the [vLLM documentation](https://docs.vllm.ai/en/stable/configuration/engine_args.html#multimodalconfig).  \n",
                "\n",
                "In particular, you can limit the number of images per request by setting `limit_mm_per_prompt` in your configuration.  \n",
                "```yaml\n",
                "applications:\n",
                "- ...\n",
                "  args:\n",
                "    llm_configs:\n",
                "        ...\n",
                "        engine_kwargs:\n",
                "          ...\n",
                "          limit_mm_per_prompt: {\"image\": 3}\n",
                "```\n",
                "\n",
                "---\n",
                "\n",
                "## Summary\n",
                "\n",
                "In this tutorial, you deployed a vision LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray cluster, and send requests with images."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "repo_ray_docs",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "name": "python",
            "version": "3.12.11"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 5
}
